AITopics | visual entity

Collaborating Authors

visual entity

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Hu, Wenbo, Gu, Jia-Chen, Dou, Zi-Yi, Fayyaz, Mohsen, Lu, Pan, Chang, Kai-Wei, Peng, Nanyun

arXiv.org Artificial IntelligenceOct-10-2024

Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs' ability to utilize retrieved visual knowledge more effectively.

knowledge, oxidation, young kitten image, (14 more...)

arXiv.org Artificial Intelligence

2410.08182

Country:

Europe > France > Île-de-France > Paris > Paris (0.14)
Asia > Myanmar (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
(14 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Automobiles & Trucks > Manufacturer (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

Park, Kwanyong, Saito, Kuniaki, Kim, Donghyun

arXiv.org Artificial IntelligenceJul-21-2024

Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a process we call "Weak-to-Strong Compositional Learning" (WSCL). To achieve this, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.

arxiv preprint arxiv, detection, triplet, (9 more...)

arXiv.org Artificial Intelligence

2407.15296

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Image Parsing via Stochastic Scene Grammar

Neural Information Processing SystemsMar-15-2024, 00:46:22 GMT

This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g.

contextual relation, production rule, relation, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.29)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Chen, Yang, Hu, Hexiang, Luan, Yi, Sun, Haitian, Changpinyo, Soravit, Ritter, Alan, Chang, Ming-Wei

arXiv.org Artificial IntelligenceOct-17-2023

Pre-trained vision and language models have demonstrated state-of-the-art capabilities over existing tasks involving images and texts, including visual question answering. However, it remains unclear whether these models possess the capability to answer questions that are not only querying visual content but knowledge-intensive and information-seeking. In this study, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training. Furthermore, we show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents, showing a significant space for improvement.

dataset, knowledge, visual entity, (14 more...)

arXiv.org Artificial Intelligence

2302.11713

Country:

North America > United States > Washington > King County > Seattle (0.04)
Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
Asia > Armenia (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.76)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Modeling the Dashboard Provenance

Jarske, Johne, Rady, Jorge, Filgueiras, Lucia V. L., Velloso, Leandro M., Santos, Tania L.

arXiv.org Artificial IntelligenceSep-16-2023

Organizations of all kinds, whether public or private, profit-driven or non-profit, and across various industries and sectors, rely on dashboards for effective data visualization. However, the reliability and efficacy of these dashboards rely on the quality of the visual and data they present. Studies show that less than a quarter of dashboards provide information about their sources, which is just one of the expected metadata when provenance is seriously considered. Provenance is a record that describes people, organizations, entities, and activities that had a role in the production, influence, or delivery of a piece of data or an object. This paper aims to provide a provenance representation model, that entitles standardization, modeling, generation, capture, and visualization, specifically designed for dashboards and its visual and data components. The proposed model will offer a comprehensive set of essential provenance metadata that enables users to evaluate the quality, consistency, and reliability of the information presented on dashboards. This will allow a clear and precise understanding of the context in which a specific dashboard was developed, ultimately leading to better decision-making.

dashboard, information, provenance, (15 more...)

arXiv.org Artificial Intelligence

2308.06788

Country:

South America > Brazil > São Paulo (0.04)
South America > Brazil > Federal District > Brasília (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Florida (0.04)

Genre: Research Report (0.64)

Industry:

Government > Regional Government (0.93)
Health & Medicine > Therapeutic Area > Immunology (0.33)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Information Management (0.95)
Information Technology > Software (0.68)

Add feedback

Image Parsing with Stochastic Scene Grammar

Neural Information Processing SystemsApr-6-2023, 12:52:01 GMT

In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules rep- resent an ensemble of visual entities. Contextual relations: (i) Cooperative " " relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive "-" relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations.

algorithm, image parsing, stochastic scene grammar, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.85)
Information Technology > Artificial Intelligence > Machine Learning (0.58)

Add feedback

Visual Named Entity Linking: A New Dataset and A Baseline

Sun, Wenxiang, Fan, Yixing, Guo, Jiafeng, Zhang, Ruqing, Cheng, Xueqi

arXiv.org Artificial IntelligenceNov-9-2022

Visual Entity Linking (VEL) is a task to link regions of images with their corresponding entities in Knowledge Bases (KBs), which is beneficial for many computer vision tasks such as image retrieval, image caption, and visual question answering. While existing tasks in VEL either rely on textual data to complement a multi-modal linking or only link objects with general entities, which fails to perform named entity linking on large amounts of image data. In this paper, we consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image. The task is to identify objects of interest (i.e., visual entity mentions) in images and link them to corresponding named entities in KBs. Since each entity often contains rich visual and textual information in KBs, we thus propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL). In addition, we present a high-quality human-annotated visual person linking dataset, named WIKIPerson. Based on WIKIPerson, we establish a series of baseline algorithms for the solution of each sub-task, and conduct experiments to verify the quality of proposed datasets and the effectiveness of baseline methods. We envision this work to be helpful for soliciting more works regarding VNEL in the future. The codes and datasets are publicly available at https://github.com/ict-bigdatalab/VNEL.

artificial intelligence, information, natural language, (17 more...)

arXiv.org Artificial Intelligence

2211.04872

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment > Sports (1.00)
Media (0.68)
Government > Regional Government > North America Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

COFAR: Commonsense and Factual Reasoning in Image Search

Gatti, Prajwal, Penamakuri, Abhirama Subramanyam, Teotia, Revant, Mishra, Anand, Sengupta, Shubhashis, Ramnani, Roshni

arXiv.org Artificial IntelligenceOct-16-2022

One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires one to reason with (i) Commonsense such as interpreting people as customers or tourists, actions as waiting to buy or going to see; and (ii) Fact or world knowledge associated with named visual entities, for example, whether the store in the image sells ice cream or whether the landmark in the image is a Mughal architecture located in India. Such reasoning goes beyond just visual recognition. To enable both commonsense and factual reasoning in the image search, we present a unified framework, namely Knowledge Retrieval-Augmented Multimodal Transformer (KRAMT), that treats the named visual entities in an image as a gateway to encyclopedic knowledge and leverages them along with natural language query to ground relevant knowledge. Further, KRAMT seamlessly integrates visual content and grounded knowledge to learn alignment between images and search queries. This unified framework is then used to perform image search requiring commonsense and factual reasoning. The retrieval performance of KRAMT is evaluated and compared with related approaches on a new dataset we introduce - namely COFAR. We make our code and dataset available at https://vl2g.github.io/projects/cofar

machine learning, natural language, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2210.08554

Country:

Asia > India (0.44)
South America > Argentina (0.04)
North America > Canada (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (1.00)
Retail (0.93)
Leisure & Entertainment > Sports (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Image Matching (0.82)

Add feedback

Image Parsing with Stochastic Scene Grammar

Zhao, Yibiao, Zhu, Song-chun

Neural Information Processing SystemsFeb-14-2020, 21:27:18 GMT

algorithm, image parsing, stochastic scene grammar, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.85)
Information Technology > Artificial Intelligence > Machine Learning (0.62)

Add feedback

Graph-Structured Visual Imitation

Sieb, Maximilian, Xian, Zhou, Huang, Audrey, Kroemer, Oliver, Fragkiadaki, Katerina

arXiv.org Artificial IntelligenceJul-11-2019

We cast visual imitation as a visual correspondence problem. Our robotic agent is rewarded when its actions result in better matching of relative spatial configurations for corresponding visual entities detected in its workspace and teacher's demonstration. We build upon recent advances in Computer Vision,such as human finger keypoint detectors, object detectors trained on-the-fly with synthetic augmentations, and point detectors supervised by viewpoint changes and learn multiple visual entity detectors for each demonstration without human annotations or robot interactions. We empirically show the proposed factorized visual representations of entities and their spatial arrangements drive successful imitation of a variety of manipulation skills within minutes, using a single demonstration and without any environment instrumentation. It is robust to background clutter and can effectively generalize across environment variations between demonstrator and imitator, greatly outperforming unstructured non-factorized full-frame CNN encodings of previous works.

artificial intelligence, detector, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1907.05518

Country: North America > United States (0.46)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback